Commodity Flow Survey Exploration by Mike Gysel

Summary of 2012 Commodity Flow Survey

The 2012 Commodity Flow Survey (CFS) is a joint effort by the Bureau of Transportation Statistics and the U.S. Census Bureau, U.S. Department of Commerce. The survey is the primary source of national and state-level data on domestic freight shipments by establishments in mining, manufacturing, whole sale auxiliaries, and selected retail and services trade industries, located in the 50 states and the District of Columbia. The survey produces estimates on the type, origin and destination, value, weight, modes of transportation, distance shipped, and ton-miles of commodities shipped.

Additional information on the Commodity Flow Survey can be found at https://catalog.data.gov/dataset/commodity-flow-survey

Univariate Plots Section

## 'data.frame':    4547661 obs. of  20 variables:
##  $ SHIPMT_ID         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ORIG_STATE        : int  25 42 26 20 12 24 19 6 19 13 ...
##  $ ORIG_MA           : int  148 428 220 556 99999 47900 99999 41740 99999 99999 ...
##  $ ORIG_CFS_AREA     : Factor w/ 169 levels "00-00000","01-00000",..: 73 128 76 56 35 70 53 15 53 39 ...
##  $ DEST_STATE        : int  25 6 47 20 12 30 19 6 19 13 ...
##  $ DEST_MA           : int  148 41740 314 556 99999 99999 99999 99999 99999 99999 ...
##  $ DEST_CFS_AREA     : Factor w/ 132 levels "01-142","01-380",..: 56 11 106 43 27 67 41 14 41 30 ...
##  $ NAICS             : int  333 311 322 323 4235 337 337 4239 327 4237 ...
##  $ QUARTER           : int  2 3 2 1 3 2 3 1 3 2 ...
##  $ SCTG              : Factor w/ 52 levels "00","01","01-05",..: 42 42 33 35 40 49 32 49 37 41 ...
##  $ MODE              : int  14 14 4 4 5 4 5 14 5 5 ...
##  $ SHIPMT_VALUE      : int  2178 344 4197 116 388 3716 43738 77 338 145 ...
##  $ SHIPMT_WGHT       : int  11 11 5134 6 527 1132 13501 4 12826 22 ...
##  $ SHIPMT_DIST_GC    : int  14 2344 470 3 124 1942 16 272 10 21 ...
##  $ SHIPMT_DIST_ROUTED: int  17 2734 579 3 201 2265 22 306 15 28 ...
##  $ TEMP_CNTL_YN      : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ EXPORT_YN         : Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 1 1 1 1 ...
##  $ EXPORT_CNTRY      : Factor w/ 4 levels "C","M","N","O": 3 3 3 3 3 1 3 3 3 3 ...
##  $ HAZMAT            : Factor w/ 3 levels "H","N","P": 2 2 2 2 2 2 2 2 2 2 ...
##  $ WGT_FACTOR        : num  208.5 193.3 51.2 238.7 398.1 ...
##    SHIPMT_ID         ORIG_STATE       ORIG_MA       ORIG_CFS_AREA    
##  Min.   :      1   Min.   : 0.00   Min.   :    0   06-348  : 146728  
##  1st Qu.:1136916   1st Qu.:17.00   1st Qu.:  312   17-176  : 104572  
##  Median :2273831   Median :29.00   Median :  476   27-378  :  88526  
##  Mean   :2273831   Mean   :29.11   Mean   :38440   55-99999:  88185  
##  3rd Qu.:3410746   3rd Qu.:42.00   3rd Qu.:99999   42-99999:  87779  
##  Max.   :4547661   Max.   :56.00   Max.   :99999   06-488  :  81383  
##                                                    (Other) :3950488  
##    DEST_STATE       DEST_MA       DEST_CFS_AREA         NAICS       
##  Min.   : 1.00   Min.   :  104   06-348  : 162548   Min.   :   212  
##  1st Qu.:17.00   1st Qu.:  332   17-176  : 116536   1st Qu.:   326  
##  Median :29.00   Median :  488   36-408  : 107485   Median :   339  
##  Mean   :28.88   Mean   :40368   06-488  :  91410   Mean   :  7130  
##  3rd Qu.:42.00   3rd Qu.:99999   48-99999:  87025   3rd Qu.:  4241  
##  Max.   :56.00   Max.   :99999   42-99999:  86783   Max.   :551114  
##                                  (Other) :3895874                   
##     QUARTER           SCTG              MODE          SHIPMT_VALUE      
##  Min.   :1.000   35     : 318586   Min.   :  0.000   Min.   :        1  
##  1st Qu.:1.000   24     : 288078   1st Qu.:  4.000   1st Qu.:      176  
##  Median :2.000   43     : 283551   Median :  5.000   Median :      875  
##  Mean   :2.446   34     : 265539   Mean   :  7.117   Mean   :    18280  
##  3rd Qu.:3.000   40     : 264089   3rd Qu.: 14.000   3rd Qu.:     6026  
##  Max.   :4.000   33     : 237139   Max.   :101.000   Max.   :521277800  
##                  (Other):2890679                                        
##   SHIPMT_WGHT        SHIPMT_DIST_GC   SHIPMT_DIST_ROUTED TEMP_CNTL_YN
##  Min.   :        1   Min.   :   1.0   Min.   :   1       N:4233061   
##  1st Qu.:       15   1st Qu.:  17.0   1st Qu.:  21       Y: 314600   
##  Median :      270   Median : 110.0   Median : 140                   
##  Mean   :    37587   Mean   : 386.1   Mean   : 475                   
##  3rd Qu.:     8077   3rd Qu.: 538.0   3rd Qu.: 685                   
##  Max.   :277029600   Max.   :5210.0   Max.   :6932                   
##                                                                      
##  EXPORT_YN   EXPORT_CNTRY HAZMAT        WGT_FACTOR      
##  N:4361940   C:  54539    H: 169457   Min.   :     0.2  
##  Y: 185721   M:  19367    N:4244682   1st Qu.:    93.0  
##              N:4361940    P: 133522   Median :   289.2  
##              O: 111815                Mean   :  2311.0  
##                                       3rd Qu.:   992.7  
##                                       Max.   :975000.0  
## 

There are 4,547,661 observations of 20 variables.

Some interesting initial findings: * The median shipment value is $875 and the mean shipment value is $18,280, which is dramatically higher due to the max shipment value of $521,277,800. What could have been shipped for over half a billion dollars?!! * The median shipment weight is 270 pounds and the mean shipment weight is 37,587 pounds, which is dramatically higher due to the maximum shipment weight of 277,029,600 pounds!! What could have weighed 277 million pounds?! * The median shipment distance is 140 miles and the mean shipment distance is 475 miles, with a maximum shipment distance of 6,932 miles. This shows that the majority of shipments were made in the US, with some made abroad, which takes us to * 23 times as many shipments were not exported (4,361,940 shipments) as were exported (185,721 shipments).

Univariate Analysis

Origin State

The states with the most outgoing shipments are California, Texas, Ohio, Pennsylvania, and New York. The states with the least outgoing shipments are Alaska, Wyoming, Delaware, New Mexico, and Vermont.

I would like to check out a histogram of the number of shipments for each origin state.

The distribution of number of outgoing shipments from each state is a right-skewed, long tail distribution.

Destination State

The states with the most incoming shipments are California, Texas, New York, Florida, and Ohio. The states with the least incoming shipments are Rhode Island, Wyoming, Alaska, Vermont, and Delaware.

I would like to check out a histogram of the number of shipments for each destination state.

The distribution of number of incoming shipments to each state, similar to outgoing shipments from each state, is a right-skewed, long tail distribution.

Metro Area

## # A tibble: 72 x 2
##    ORIG_MA       n
##      <int>   <int>
##  1   99999 1630669
##  2     408  208768
##  3     348  146728
##  4     176  127993
##  5     148  124844
##  6     428  109581
##  7     378   88526
##  8     488   81383
##  9     122   79936
## 10     206   73944
## # ... with 62 more rows
## # A tibble: 71 x 2
##    DEST_MA       n
##      <int>   <int>
##  1   99999 1706285
##  2     408  236653
##  3     348  162548
##  4     176  131130
##  5     148  114853
##  6     428   94747
##  7     488   91410
##  8     206   83248
##  9     288   79953
## 10     122   77286
## # ... with 61 more rows

The top 5 origin and destination metro areas with the most outgoing/incoming shipments are the same: New York-Newark, Los Angeles-Long Beach, Chicago-Naperville, Boston-Worcester-Providence, and Philadelphia-Reading-Camden. Note: MA 99999 represents the remainder of numerous different areas, and was not counted here.

I would like to check out a histogram of the number of shipments for each origin and destination metro area.

The distribution of number of incoming shipments to each metro area and the distribution of number of outgoing shipments from each metro area both display a right-skewed, long tail distribution. This is what we would expect, given the distribution of shipment quantity for each state.

NAICS (North American Industry Classification System) Codes

## # A tibble: 45 x 2
##    NAICS      n
##    <int>  <int>
##  1   325 221722
##  2   332 209430
##  3  4238 199770
##  4   311 186452
##  5  4244 175812
##  6   326 151976
##  7   333 147874
##  8   212 140933
##  9   323 139305
## 10  4236 138760
## # ... with 35 more rows

The industries with the most shipments are the Chemical Manufacturing Sector; Aerospace Sector; Machinery, Equipment, and Supplies Merchant Wholesalers; Food Manufacturing; and Grocery and Related Product Merchant Wholesalers. It appears America likes to ship chemicals, machines, and food!!

I would also like to view the distribution of shipments per industry.

The distribution of shipments for each industry appears relatively symmetric, with a bit of a tail to the right.

SCTG (Standard Classification of Transported Goods) Codes

## # A tibble: 52 x 2
##    SCTG       n
##    <fct>  <int>
##  1 35    318586
##  2 24    288078
##  3 43    283551
##  4 34    265539
##  5 40    264089
##  6 33    237139
##  7 30    213796
##  8 29    200529
##  9 32    187159
## 10 07    183021
## # ... with 42 more rows

The industries with the most shipments are Electronic and Other Electrical Equipment and Components, and Office Equipment; Plastics and Rubber; Mixed Freight; Machinery; and Miscellaneous Manufactured Products. These largely align with NAICS Codes, with Plastics and Rubber aligning with the Chemical Manufacturing Sector, and both Electrical Equipment and Machinery aligning with Machinery.

Quarter

# Determine the number of shipments for each quarter
arrange(count(cfs, QUARTER), desc(n))
## # A tibble: 4 x 2
##   QUARTER       n
##     <int>   <int>
## 1       1 1188440
## 2       2 1182850
## 3       3 1137376
## 4       4 1038995

The distribution of shipments is relatively even, with slightly fewer shipments each subsequent quarter.

Mode

The three modes of transit for each shipment with by far the most shipments are Private Truck; For-hire Truck; and Parcel, USPS, or Courier. This makes me wonder which modes are typically used for different attributes, such as distance, shipment value, shipment type, etc.

Shipment Value

The distribution of shipment value is a right-skewed, long tail distribution. This distribution makes me want to use a logarithmic scale on the x-axis.

As expected, the logarithmic transformed shipment value distribution is relatively symmetric.

Shipment Weight

The distribution of shipment weight is right-skewed with a peak less than 100 pounds, and a secondary peak around 3 million pounds. This distribution makes me want to use a logarithmic scale on the x-axis.

This logarithmic transformed shipment value distribution looks relatively flat with a low and high peak.

Shipment Distance Routed

The distribution of shipment distance routed is a right-skewed, long tail distribution. This distribution makes me want to use a logarithmic scale on the x-axis.

Unexpectedly, the logarithmic distribution is bimodal, with a peak around 10 pounds and a larger peak around 1,000 pounds. This makes me wonder what different attributes these two peaks have.

Export Country

## # A tibble: 4 x 2
##   EXPORT_CNTRY       n
##   <fct>          <int>
## 1 N            4361940
## 2 O             111815
## 3 C              54539
## 4 M              19367

As expected, most (95.9%) shipments were not exports. Of the shipments that were exports, 29.4% went to Canada, 10.4% went to Mexico, and 60.2% wentto other countries.

Hazmat

## # A tibble: 3 x 2
##   HAZMAT       n
##   <fct>    <int>
## 1 H       169457
## 2 N      4244682
## 3 P       133522

93.3% of shipments are Not Hazmat (good to hear), 2.9% are Class 3.0 Hazmat (flammable liquids), and 3.7% are Other Hazmat.

What is the structure of your dataset?

There are 4,547,661 shipments with 20 characteristics (shipment identifier, origin state, origin metro area, origin CFS area, destination state, destination metro area, destination CFS area, NAICS industry classification, quarter shipment occurred, SCTG commodity code, mode of transportation, shipment value, shipment weight, great circle distance, routed distance, temperature control, export shipment, export destination, hazardous material, and weighting factor).

What is/are the main feature(s) of interest in your dataset?

Some primary inquiries I have are: * What shipments weigh and cost so much? * How does mode of transportation vary with shipment distance and weight? * What are the differences in great circle distance and routed distance. How do these vary as distance increases? Does the ratio of great circle distance to routed distance increase with exports? * Shipment weights peak around 10 pounds and 1,000 pounds. What are the attributes of these two peaks? * What are attributes of packages we send abroad?

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I think most of the features can potentially be useful with this broad range of inquiries above.

Did you create any new variables from existing variables in the dataset?

I counted the occurrences of different categorical variables, but did not create any new ones.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Per above, the distribution of shipment weights, with peaks around 10 pounds and 1,000 pounds is interesting.

Bivariate Plots Section

Most variables show no relation, but some interesting patterns: * Weak correlation between shipment weight and shipment value. * Temperature controlled shipments are of greater value and weight on average. * Export shipments are of greater value and weight on average. * Hazmat shipment categories have different average values and weights. * Distance routed differs by temperature control, export, and hazmat.

Shipment Weights - What shipments weigh so much?

Top 1% of Shipments by Weight

##    SHIPMT_ID         ORIG_STATE       ORIG_MA       ORIG_CFS_AREA  
##  Min.   :    195   Min.   : 1.00   Min.   :    0   48-288  : 2788  
##  1st Qu.:1128301   1st Qu.:19.00   1st Qu.:  350   19-99999: 1737  
##  Median :2286056   Median :29.00   Median :99999   56-99999: 1653  
##  Mean   :2278369   Mean   :30.72   Mean   :54799   48-99999: 1315  
##  3rd Qu.:3423168   3rd Qu.:48.00   3rd Qu.:99999   17-99999: 1086  
##  Max.   :4547643   Max.   :56.00   Max.   :99999   22-406  : 1082  
##                                                    (Other) :35816  
##    DEST_STATE       DEST_MA       DEST_CFS_AREA       NAICS       
##  Min.   : 1.00   Min.   :  104   48-288  : 2625   Min.   :   212  
##  1st Qu.:17.00   1st Qu.:  332   48-99999: 2163   1st Qu.:   311  
##  Median :29.00   Median :12940   22-406  : 1614   Median :   325  
##  Mean   :30.33   Mean   :45108   06-348  : 1393   Mean   :  5393  
##  3rd Qu.:47.00   3rd Qu.:99999   26-220  : 1118   3rd Qu.:   333  
##  Max.   :56.00   Max.   :99999   17-176  : 1053   Max.   :551114  
##                                  (Other) :35511                   
##     QUARTER           SCTG            MODE         SHIPMT_VALUE      
##  Min.   :1.000   20     : 4388   Min.   :  0.00   Min.   :        7  
##  1st Qu.:1.000   24     : 3597   1st Qu.:  6.00   1st Qu.:    20561  
##  Median :2.000   15     : 3300   Median :  6.00   Median :    53088  
##  Mean   :2.444   41     : 3011   Mean   :  7.67   Mean   :   598528  
##  3rd Qu.:3.000   02     : 2824   3rd Qu.:  8.00   3rd Qu.:   150672  
##  Max.   :4.000   12     : 2695   Max.   :101.00   Max.   :521277800  
##                  (Other):25662                                       
##   SHIPMT_WGHT        SHIPMT_DIST_GC   SHIPMT_DIST_ROUTED TEMP_CNTL_YN
##  Min.   :   170254   Min.   :   1.0   Min.   :   1.0     N:43936     
##  1st Qu.:   190928   1st Qu.:  46.0   1st Qu.:  64.0     Y: 1541     
##  Median :   212019   Median : 306.0   Median : 454.0                 
##  Mean   :  2819972   Mean   : 451.5   Mean   : 642.5                 
##  3rd Qu.:   580119   3rd Qu.: 730.0   3rd Qu.:1031.0                 
##  Max.   :277029600   Max.   :4210.0   Max.   :6791.0                 
##                                                                      
##  EXPORT_YN EXPORT_CNTRY HAZMAT      WGT_FACTOR     
##  N:41564   C: 1219      H: 3897   Min.   :    0.3  
##  Y: 3913   M:  936      N:37148   1st Qu.:   28.9  
##            N:41564      P: 4432   Median :   67.3  
##            O: 1758                Mean   :  170.2  
##                                   3rd Qu.:  160.2  
##                                   Max.   :25676.9  
## 

Some initial interesting findings: * Most are not temperature controlled * Most are not exported * Most are not Hazmat

What NAICS and SCTG Codes have shipments with the highest average weights?

## # A tibble: 45 x 2
##     NAICS     avg
##     <int>   <dbl>
##  1    212 588041.
##  2   4245 403757.
##  3    324 354552.
##  4 551114  58268.
##  5    325  41900.
##  6    327  38276.
##  7   4247  32626.
##  8   4239  30116.
##  9    321  28020.
## 10    311  27680.
## # ... with 35 more rows
## # A tibble: 52 x 2
##    SCTG       avg
##    <fct>    <dbl>
##  1 15    5596597.
##  2 15-19 4362761.
##  3 99    2812023.
##  4 01-05 2801830.
##  5 14    1683167.
##  6 10-14 1187371.
##  7 02     357706.
##  8 20-24  299220.
##  9 17     270357.
## 10 06-09  257195.
## # ... with 42 more rows

Some observations related to shipment weight: * The average shipment weight for Mining and Quarrying (except oil and gas) at 588,041 pounds, Farm Product Raw Material Merchant Wholesaler at 403,757 pounds, and Petroleum and Coal Product Manufacturing at 354,552 pounds far exceed the shipment weights for any other NAICS Codes. * The average shipment weight for Coal at 5,596,597 pounds and Metallic Ores and Concentrates at 1,683,167 pounds far exceed any other SCTG Codes.

What transportation modes are used for shipments with the highest weights?

##  [1] 5.465400e-01 1.074609e-01 8.947380e-02 7.617037e-02 6.047013e-02
##  [6] 4.538558e-02 3.177430e-02 1.750335e-02 5.541263e-03 3.672186e-03
## [11] 3.518262e-03 2.660686e-03 2.330849e-03 2.198914e-03 1.957033e-03
## [16] 1.605207e-03 9.675220e-04 7.256415e-04 4.397827e-05
##  [1] 0.5466138962 0.1578715919 0.1046613896 0.0543095866 0.0211081794
##  [6] 0.0167106420 0.0153913808 0.0151715040 0.0151715040 0.0138522427
## [11] 0.0083553210 0.0074758135 0.0072559367 0.0061565523 0.0054969217
## [16] 0.0039577836 0.0004397537

Some observations related to shipment weight and transportation mode: * For shipments in the top 1% of weight, 55% are transported by rail and 11% are transported by truck and rail. * For shipments in the top 0.1% of weight, 54% are transported by rail, 16% are transported by pipeline, and 10% are transported by Inland Water.

This makes me wonder how heavy export shipments are transported.

Are shipment weight and distance correlated?

## 
## Call:
## lm(formula = cfs_lm$SHIPMT_DIST_ROUTED ~ cfs_lm$SHIPMT_WGHT)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -552.2 -508.5 -326.4  255.0 6397.8 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        553.203032   0.429246  1288.8   <2e-16 ***
## cfs_lm$SHIPMT_WGHT  -0.040641   0.000214  -189.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 723.6 on 3469208 degrees of freedom
## Multiple R-squared:  0.01029,    Adjusted R-squared:  0.01029 
## F-statistic: 3.607e+04 on 1 and 3469208 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = cfs_lm$SHIPMT_DIST_ROUTED ~ cfs_lm$SHIPMT_WGHT + 
##     (log(cfs_lm$SHIPMT_WGHT)))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -765.7 -447.6 -299.8  243.7 6474.8 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              7.667e+02  7.881e-01  972.75   <2e-16 ***
## cfs_lm$SHIPMT_WGHT       2.244e-02  2.883e-04   77.86   <2e-16 ***
## log(cfs_lm$SHIPMT_WGHT) -6.307e+01  1.965e-01 -320.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 713 on 3469207 degrees of freedom
## Multiple R-squared:  0.03884,    Adjusted R-squared:  0.03884 
## F-statistic: 7.009e+04 on 2 and 3469207 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = cfs_lm$SHIPMT_DIST_ROUTED ~ cfs_lm$SHIPMT_WGHT + 
##     (cfs_lm$SHIPMT_WGHT^2))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -552.2 -508.5 -326.4  255.0 6397.8 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        553.203032   0.429246  1288.8   <2e-16 ***
## cfs_lm$SHIPMT_WGHT  -0.040641   0.000214  -189.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 723.6 on 3469208 degrees of freedom
## Multiple R-squared:  0.01029,    Adjusted R-squared:  0.01029 
## F-statistic: 3.607e+04 on 1 and 3469208 DF,  p-value: < 2.2e-16

It appears there is an exponential relationship with shipment weight and routed shipment distance, when removing outliers above a 10,000 pound shipment weight. The linear, exponential, and quadratic models however, with very low correlation coefficients, say otherwise.

Are shipment weight and value correlated?

## 
## Call:
## lm(formula = cfs$SHIPMT_VALUE ~ cfs$SHIPMT_WGHT)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -32299110    -13717    -13220     -9436 521253061 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.382e+04  5.064e+02   27.28   <2e-16 ***
## cfs$SHIPMT_WGHT 1.187e-01  5.240e-04  226.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1079000 on 4547659 degrees of freedom
## Multiple R-squared:  0.01117,    Adjusted R-squared:  0.01117 
## F-statistic: 5.135e+04 on 1 and 4547659 DF,  p-value: < 2.2e-16

The bulk of the data does not show much of a correlation between shipment weight and value. The R-Squared value of only 0.01 supports this. This is surprising, as I would have expected very heavy shipments to be worth a lot more than very light shipments.

What is the median/average shipment weight for different exports?

## [1] 0.11518064 0.35142257 0.09774863 0.43564815

The median exports weight to Mexico dramatically exceeds the median export weights anywhere else. What is being exported to Mexico?

arrange(count(subset(cfs, EXPORT_CNTRY=='M'), NAICS), desc(n))
## # A tibble: 44 x 2
##    NAICS     n
##    <int> <int>
##  1   332  1811
##  2   325  1739
##  3   336  1653
##  4   326  1622
##  5   333  1394
##  6   331  1080
##  7   335   836
##  8  4236   635
##  9   334   628
## 10   322   627
## # ... with 34 more rows

The NAICS Codes with the most shipments to Mexico are Fabricated Metal Product Manufacturing, Chemical Manufacturing, Transportation Equipment Manufacturing, Plastics and Rubber Products Manufacturing, and Machinery Manufacturing.

Shipment Values - What shipments cost so much?

Top 1% of Shipments by Value

##    SHIPMT_ID         ORIG_STATE       ORIG_MA       ORIG_CFS_AREA  
##  Min.   :     28   Min.   : 1.00   Min.   :    0   48-288  : 1971  
##  1st Qu.:1132368   1st Qu.:18.00   1st Qu.:  288   06-348  : 1530  
##  Median :2263203   Median :32.00   Median :  482   26-220  :  950  
##  Mean   :2270108   Mean   :30.62   Mean   :40819   42-99999:  846  
##  3rd Qu.:3398338   3rd Qu.:45.00   3rd Qu.:99999   19-99999:  845  
##  Max.   :4547516   Max.   :56.00   Max.   :99999   56-99999:  833  
##                                                    (Other) :38502  
##    DEST_STATE       DEST_MA       DEST_CFS_AREA       NAICS       
##  Min.   : 1.00   Min.   :  104   48-288  : 2404   Min.   :   212  
##  1st Qu.:17.00   1st Qu.:  294   06-348  : 2341   1st Qu.:   325  
##  Median :29.00   Median :  430   22-406  : 1928   Median :   333  
##  Mean   :29.75   Mean   :33852   48-99999: 1310   Mean   : 10741  
##  3rd Qu.:45.00   3rd Qu.:99999   26-220  : 1153   3rd Qu.:  4238  
##  Max.   :56.00   Max.   :99999   17-176  : 1134   Max.   :551114  
##                                  (Other) :35207                   
##     QUARTER           SCTG            MODE          SHIPMT_VALUE      
##  Min.   :1.000   34     : 5401   Min.   :  0.000   Min.   :   161552  
##  1st Qu.:1.000   36     : 3549   1st Qu.:  4.000   1st Qu.:   198120  
##  Median :2.000   35     : 3387   Median :  4.000   Median :   266118  
##  Mean   :2.432   30     : 2684   Mean   :  6.165   Mean   :  1070089  
##  3rd Qu.:3.000   43     : 2623   3rd Qu.:  6.000   3rd Qu.:   484339  
##  Max.   :4.000   21     : 2426   Max.   :101.000   Max.   :521277800  
##                  (Other):25407                                        
##   SHIPMT_WGHT        SHIPMT_DIST_GC   SHIPMT_DIST_ROUTED TEMP_CNTL_YN
##  Min.   :        1   Min.   :   1.0   Min.   :   1.0     N:42292     
##  1st Qu.:    13462   1st Qu.:  61.0   1st Qu.:  84.0     Y: 3185     
##  Median :    35224   Median : 322.0   Median : 445.0                 
##  Mean   :  2410927   Mean   : 522.7   Mean   : 679.2                 
##  3rd Qu.:   146578   3rd Qu.: 788.0   3rd Qu.:1041.0                 
##  Max.   :277029600   Max.   :5012.0   Max.   :6791.0                 
##                                                                      
##  EXPORT_YN EXPORT_CNTRY HAZMAT      WGT_FACTOR     
##  N:37830   C: 1279      H: 2192   Min.   :    0.2  
##  Y: 7647   M:  742      N:40824   1st Qu.:   18.2  
##            N:37830      P: 2461   Median :   39.8  
##            O: 5626                Mean   :  111.0  
##                                   3rd Qu.:  104.5  
##                                   Max.   :25676.9  
## 

Some initial interesting findings: * Most are not temperature controlled * Most are not exported * Most are not Hazmat

What NAICS and SCTG Codes have shipments with the highest average values?

## # A tibble: 45 x 2
##     NAICS     avg
##     <int>   <dbl>
##  1    336 192388.
##  2    324 138415.
##  3   4245  81859.
##  4    325  43281.
##  5   4242  38882.
##  6   4931  37794.
##  7    331  34882.
##  8 551114  34486.
##  9    311  24577.
## 10    333  23431.
## # ... with 35 more rows
## # A tibble: 52 x 2
##    SCTG       avg
##    <fct>    <dbl>
##  1 15-19 1380790.
##  2 35-38 1284831.
##  3 00     835268.
##  4 01-05  585695.
##  5 99     355346.
##  6 37     251358.
##  7 20-24  169013.
##  8 14     145027.
##  9 17     120425.
## 10 10-14  117273.
## # ... with 42 more rows

Some observations related to shipment value: * The average shipment value for Transportation Equipment Manufacturing at $192,388, Petroleum and Coal Product Manufacturing at $138,415, and Farm Product Raw Material Merchant Wholesalers at $81,859 far exceed the average shipment values for any other NAICS Code. * The average shipment value for Transportation Equipment, not esewhere classified at $355,346; Metallic Ores and Concentrates at $145,027; and Gasoline, Aviation Turbine Fuel, and Ethanol (includes Kerosene, and Fuel Alcohols) have the greatest average shipment values among SCTG Codes. It appears transportation shipments cost a lot!

What transportation modes are used for shipments with the highest values?

##  [1] 0.5637355147 0.1131561009 0.1068012402 0.0463311124 0.0439342965
##  [6] 0.0315544121 0.0287178134 0.0226708006 0.0183829188 0.0082899048
## [11] 0.0054313169 0.0031004684 0.0015172505 0.0013853157 0.0013853157
## [16] 0.0012533808 0.0008575764 0.0008575764 0.0004177936 0.0002198914
##  [1] 0.3120052770 0.2198768690 0.0976253298 0.0952066843 0.0732189974
##  [6] 0.0690413369 0.0457343887 0.0175901495 0.0173702726 0.0147317502
## [11] 0.0096745822 0.0085751979 0.0074758135 0.0048372911 0.0026385224
## [16] 0.0017590150 0.0013192612 0.0006596306 0.0004397537 0.0002198769

Some observations related to shipment value and transportation mode: * For shipments in the top 1% of value, 56% are transported by For-hire Truck, 11% are transported by Rail, and another 11% are transported by Private Truck. * For shipments in the top 0.1% of value, 41% are transported by For-Hire Truck and 22% are transported by Pipeline.

What is the median/average shipment value for different exports?

## # A tibble: 4 x 5
##   EXPORT_CNTRY median     avg median_perc avg_perc
##   <fct>         <dbl>   <dbl>       <dbl>    <dbl>
## 1 C             3914.  25133.      0.191    0.128 
## 2 M             9135.  54454.      0.445    0.276 
## 3 N              814.  15897.      0.0396   0.0807
## 4 O             6679. 101602.      0.325    0.516

The median exports value to each country is relatively similar, but the mean exports value to Other countries (not Mexico or Canada) is significantly higher than other exports. What NAICS/SCTG Codes are being exported to Other Countries, for exports with the highest values?

## # A tibble: 44 x 2
##     NAICS      avg
##     <int>    <dbl>
##  1   4245 1073205.
##  2    336  912366.
##  3    324  865211.
##  4    212  313935.
##  5   4247  172265.
##  6    312  104866.
##  7    325   94822.
##  8 551114   85187.
##  9   4931   75392.
## 10    331   72618.
## # ... with 34 more rows
## # A tibble: 51 x 2
##    SCTG        avg
##    <fct>     <dbl>
##  1 18    19860223.
##  2 15-19  4164445.
##  3 01-05  3301144.
##  4 17     2265182.
##  5 02     1569041.
##  6 15      788179.
##  7 35-38   585661.
##  8 37      469188.
##  9 36      394739.
## 10 03      341148.
## # ... with 41 more rows

The industries with the most value shipped to Other countries are: * NAICS Codes: Farm Product Raw Material Merchant Wholesalers, Transportation Equipment Manufacturing, and Petroleum and Coal Manufacturing. * SCTG Categories: Fuel Oils (includes Diesel, Bunker C, and Biodiesel); Gasoline, Aviation Turbine Fuel, and Ethanol (includes Kerosene, and Fuel Alcohols); and Cereal Grains (includes seed).

The industries with the most value being shipped, regardless of how the data is subset, appear to all be related to Agriculture, Transportation, and Energy.

What mode of transit is used for different distances?

Air (incl truck & air), Truck and rail, and Truck and water are the three modes of transit with the highest mean shipment distance. Air (incl truck & air), and Truck and rail were the modes of transit with the highest median shipment distance.

What are the differences in great circle distance and routed distance.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0012   1.1429   1.2128   1.3955   1.3138 839.0000

The median and mean ratio between great circle distance and routed distance are 1.2128 and 1.3955, respectively.

## 
## Call:
## lm(formula = cfs$dist_ratio ~ cfs$SHIPMT_DIST_GC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1.14  -0.28  -0.17  -0.04 837.53 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.470e+00  2.608e-03  563.62   <2e-16 ***
## cfs$SHIPMT_DIST_GC -1.933e-04  3.768e-06  -51.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.617 on 4547659 degrees of freedom
## Multiple R-squared:  0.0005782,  Adjusted R-squared:  0.000578 
## F-statistic:  2631 on 1 and 4547659 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = cfs$dist_ratio ~ cfs$SHIPMT_DIST_GC + (log(cfs$SHIPMT_DIST_GC)))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1.20  -0.27  -0.11  -0.01 837.25 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.824e+00  6.562e-03  278.01   <2e-16 ***
## cfs$SHIPMT_DIST_GC       7.475e-05  5.912e-06   12.64   <2e-16 ***
## log(cfs$SHIPMT_DIST_GC) -1.008e-01  1.714e-03  -58.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.615 on 4547658 degrees of freedom
## Multiple R-squared:  0.001338,   Adjusted R-squared:  0.001337 
## F-statistic:  3046 on 2 and 4547658 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = cfs$dist_ratio ~ cfs$SHIPMT_DIST_GC + (cfs$SHIPMT_DIST_GC^2))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1.14  -0.28  -0.17  -0.04 837.53 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.470e+00  2.608e-03  563.62   <2e-16 ***
## cfs$SHIPMT_DIST_GC -1.933e-04  3.768e-06  -51.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.617 on 4547659 degrees of freedom
## Multiple R-squared:  0.0005782,  Adjusted R-squared:  0.000578 
## F-statistic:  2631 on 1 and 4547659 DF,  p-value: < 2.2e-16

It appears there is an exponential relationship with great circle shipment distance and distance ratio. The linear, exponential, and quadratic models however, each with low correlation coefficients, say otherwise.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

There are few correlations between quantitative variables in the dataset, but the most interesting findings are: * The shipments with the most weight and value tend to be in the agriculture, transportation, and energy industries. * The heaviest shipments are transported mainly by rail, while the average shipments are transported mainly by road. * Air, water, and rail are the modes of transit used for the longest shipment distances. * The average weight of shipments exported to Mexico far exceed shipments exported elsewhere; the average value of shipments exported to other countries far exceed shipments exported to Mexico or Canada.

What was the strongest relationship you found?

The correlation coefficient between shipment weight and shipment value was 0.33, per the pairplot.

Multivariate Plots Section

Shipment distance routed had a peak around 10 miles and a peak around 1000
miles. What are the different attributes of these peaks?

##    SHIPMT_ID         ORIG_STATE       ORIG_MA      ORIG_CFS_AREA    
##  Min.   :      1   Min.   : 0.00   Min.   :    0   06-348 :  55447  
##  1st Qu.:1134824   1st Qu.:16.00   1st Qu.:  314   17-176 :  33094  
##  Median :2273367   Median :29.00   Median :  476   06-488 :  32424  
##  Mean   :2273371   Mean   :28.68   Mean   :36101   36-408 :  31162  
##  3rd Qu.:3410118   3rd Qu.:42.00   3rd Qu.:99999   34-408 :  30823  
##  Max.   :4547661   Max.   :56.00   Max.   :99999   25-148 :  29408  
##                                                    (Other):1444921  
##    DEST_STATE       DEST_MA      DEST_CFS_AREA         NAICS       
##  Min.   : 1.00   Min.   :  104   06-348 :  55626   Min.   :   212  
##  1st Qu.:16.00   1st Qu.:  314   36-408 :  36617   1st Qu.:   327  
##  Median :29.00   Median :  476   17-176 :  34482   Median :  4236  
##  Mean   :28.69   Mean   :36610   06-488 :  31856   Mean   :  8810  
##  3rd Qu.:42.00   3rd Qu.:99999   25-148 :  31429   3rd Qu.:  4244  
##  Max.   :56.00   Max.   :99999   48-288 :  29084   Max.   :551114  
##                                  (Other):1438185                   
##     QUARTER          SCTG              MODE          SHIPMT_VALUE      
##  Min.   :1.00   43     : 122804   Min.   :  0.000   Min.   :        1  
##  1st Qu.:1.00   19     : 103513   1st Qu.:  5.000   1st Qu.:      138  
##  Median :2.00   31     :  89503   Median :  5.000   Median :      490  
##  Mean   :2.46   35     :  84115   Mean   :  5.476   Mean   :    16614  
##  3rd Qu.:3.00   24     :  83471   3rd Qu.:  5.000   3rd Qu.:     2431  
##  Max.   :4.00   07     :  82533   Max.   :101.000   Max.   :521277800  
##                 (Other):1091340                                        
##   SHIPMT_WGHT        SHIPMT_DIST_GC   SHIPMT_DIST_ROUTED TEMP_CNTL_YN
##  Min.   :        1   Min.   :   1.0   Min.   : 1.00      N:1536312   
##  1st Qu.:       29   1st Qu.:   6.0   1st Qu.: 7.00      Y: 120967   
##  Median :      407   Median :  11.0   Median :13.00                  
##  Mean   :    31098   Mean   :  13.8   Mean   :17.05                  
##  3rd Qu.:     8381   3rd Qu.:  20.0   3rd Qu.:25.00                  
##  Max.   :215926090   Max.   :1690.0   Max.   :49.00                  
##                                                                      
##  EXPORT_YN   EXPORT_CNTRY HAZMAT        WGT_FACTOR      
##  N:1640663   C:   1173    H: 105930   Min.   :     0.3  
##  Y:  16616   M:    941    N:1458485   1st Qu.:   137.4  
##              N:1640663    P:  92864   Median :   406.6  
##              O:  14502                Mean   :  2118.4  
##                                       3rd Qu.:  1232.3  
##                                       Max.   :975000.0  
##                                                         
##    dist_ratio      
##  Min.   : 0.00118  
##  1st Qu.: 1.00000  
##  Median : 1.21429  
##  Mean   : 1.24275  
##  3rd Qu.: 1.35000  
##  Max.   :44.00000  
## 
##    SHIPMT_ID         ORIG_STATE       ORIG_MA       ORIG_CFS_AREA   
##  Min.   :     19   Min.   : 0.00   Min.   :    0   27-378  : 22033  
##  1st Qu.:1135294   1st Qu.:17.00   1st Qu.:  312   06-348  : 21761  
##  Median :2270721   Median :29.00   Median :  464   48-206  : 15130  
##  Mean   :2273187   Mean   :29.12   Mean   :38056   55-99999: 13843  
##  3rd Qu.:3409801   3rd Qu.:42.00   3rd Qu.:99999   17-176  : 11512  
##  Max.   :4547655   Max.   :56.00   Max.   :99999   19-99999: 10851  
##                                                    (Other) :418671  
##    DEST_STATE       DEST_MA       DEST_CFS_AREA        NAICS       
##  Min.   : 1.00   Min.   :  104   48-99999: 25321   Min.   :   212  
##  1st Qu.:12.00   1st Qu.:  312   06-348  : 25318   1st Qu.:   326  
##  Median :30.00   Median :  476   48-206  : 24146   Median :   334  
##  Mean   :29.43   Mean   :37106   48-288  : 22352   Mean   :  5951  
##  3rd Qu.:48.00   3rd Qu.:99999   12-370  : 17319   3rd Qu.:  4236  
##  Max.   :56.00   Max.   :99999   08-216  : 12418   Max.   :551114  
##                                  (Other) :386927                   
##     QUARTER          SCTG             MODE          SHIPMT_VALUE      
##  Min.   :1.00   35     : 61620   Min.   :  0.000   Min.   :        1  
##  1st Qu.:1.00   40     : 54726   1st Qu.:  4.000   1st Qu.:      156  
##  Median :2.00   30     : 41102   Median : 14.000   Median :      937  
##  Mean   :2.43   34     : 39746   Mean   :  9.534   Mean   :    21547  
##  3rd Qu.:3.00   24     : 36761   3rd Qu.: 14.000   3rd Qu.:     7616  
##  Max.   :4.00   38     : 32056   Max.   :101.000   Max.   :521277800  
##                 (Other):247790                                        
##   SHIPMT_WGHT        SHIPMT_DIST_GC SHIPMT_DIST_ROUTED TEMP_CNTL_YN
##  Min.   :        1   Min.   :   2   Min.   :1001       N:491357    
##  1st Qu.:        5   1st Qu.: 928   1st Qu.:1151       Y: 22444    
##  Median :       39   Median :1077   Median :1329                   
##  Mean   :    67678   Mean   :1117   Mean   :1386                   
##  3rd Qu.:     1158   3rd Qu.:1315   3rd Qu.:1593                   
##  Max.   :138300770   Max.   :2345   Max.   :1999                   
##                                                                    
##  EXPORT_YN  EXPORT_CNTRY HAZMAT       WGT_FACTOR         dist_ratio      
##  N:459032   C: 14964     H:  7915   Min.   :     0.2   Min.   :  0.5835  
##  Y: 54769   M: 10544     N:501340   1st Qu.:    73.7   1st Qu.:  1.1611  
##             N:459032     P:  4546   Median :   232.0   Median :  1.2044  
##             O: 29261                Mean   :  3014.4   Mean   :  1.8746  
##                                     3rd Qu.:   874.0   3rd Qu.:  1.2732  
##                                     Max.   :975000.0   Max.   :839.0000  
## 

The low and high peak for shipment distances appear to differ in shipment weight and shipment value.

The distribution of shipment weight for the low-distance (0 to 50 miles) and high-distance (1,000 to 2,000 miles) shipments appears to be very similar. It seems like this graph would be better if the count were a proportion of total count for each distance category, as this would account for the total number of shipments.

These three line graphs show that a higher proportion of high distance shipments have low shipment weights. At approximately 50 pound shipment weight, the proportion of low distance shipments surpasses that of high distance shipments.

dist_ratio for exports and non-exports

The patterns of distance ratio behave pretty erratically. The only patterns that can be observed are the ratio decreasing as distance increases. The distance ratio for non Mexico/Canada exports also appear to be the highest of any group.

How does mode of transit change with shipment distance?

Now subset the data to only include the predominant modes of transport.

For-hire and Private Truck have the highest percentages of shipments until about 1,000 miles, where Parcel, USPS, or courier has the highest percentage of shipments.

How does mode of transit change with shipment weight?

Now subset the data to only include the predominany modes of transport.

Parcel, USPS, or Courier has the highest percentage of shipments up to about 70 pounds, and then falls precipitously to practically none of the shipments. For-Hire Truck and Private Truck have the highest percentage of shipments that weigh greater than 70 pounds. Air Travel has some packages up to about 70 pounds, and then falls to a very small percentage of packages.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The main findings relate to the mode of transport used: * Shipment Distance: For-hire and Private Truck have the highest percentages of shipments up about 1,000 miles, where Parcel, USPS, or courier has the highest percentage of shipments. * Shipment Weight: Parcel, USPS, or Courier has the highest percentage of shipments up to about 70 pounds, and then falls precipitously to practically none of the shipments. For-Hire Truck and Private Truck have the highest percentage of shipments that weighh greater than 70 pounds. Air Travel has some packages up to about 70 pounds, and then falls to a very small percentage of packages.

Were there any interesting or surprising interactions between features?

I was surprised the ratio between great circle distance and routed distance behaved so erratically with exports. This might relate to the location where shipments can legally enter each country.


Final Plots and Summary

Plot One

##   SHIPMT_ID ORIG_STATE ORIG_MA ORIG_CFS_AREA DEST_STATE DEST_MA
## 1    472066         48     206        48-206         48     206
##   DEST_CFS_AREA NAICS QUARTER SCTG MODE SHIPMT_VALUE SHIPMT_WGHT
## 1        48-206   336       3   37   11    521277800       91990
##   SHIPMT_DIST_GC SHIPMT_DIST_ROUTED TEMP_CNTL_YN EXPORT_YN EXPORT_CNTRY
## 1              7                  7            N         N            N
##   HAZMAT WGT_FACTOR dist_ratio
## 1      N        3.5          1

Description One

The distribution of shipment value is a right-skewed long-tailed distribution with the majority of shipments falling below $1,000. The maximum valued shipment of $521,277,800 was a 91,990 pound ‘Transportation Equipment Manufacturing’ shipment delivered 7 miles by Air from Dallas-Fortworth to Dallas-Fortworth.

Plot Two

Description Two

For-hire and Private Truck have the highest percentages of shipments for shipment distances up to about 1,000 miles, where Parcel, USPS, or courier has the highest percentage of shipments.

Plot Three

Description Three

Parcel, USPS, or Courier has the highest percentage of shipments up to about 70 pounds, and then falls precipitously to practically none of the shipments. For-Hire Truck and Private Truck have the highest percentage of shipments that weigh greater than 70 pounds. Air Travel has a low percentage of packages up to about 70 pounds, and then falls to a very small percentage of packages.


Reflection

The main conclusions drawn from the exploration of the Commodity Flow Survey dataset are: * The shipments with the most weight and value tend to be in the agriculture, transportation, and energy industries. * The heaviest shipments are transported mainly by rail, while the average shipments are transported mainly by road. * Air, water, and rail are the modes of transit used for the longest shipment distances. * For-hire and Private Truck have the highest percentages of shipments up about 1,000 miles, where Parcel, USPS, or courier has the highest percentage of shipments. * Parcel, USPS, or Courier has the highest percentage of shipments up to about 70 pounds, and then falls precipitously to practically none of the shipments. For-Hire Truck and Private Truck have the highest percentage of shipments that weighh greater than 70 pounds. Air Travel has some packages up to about 70 pounds, and then falls to a very small percentage of packages.

Some challenges with the analysis relate to the large size (4,547,661 observations). As a result of the size, a 10,000 observation sample was used to explore different resource-heavy plots, such as the pairplot.

I would like to further explore different regressions that could be conducted with different variables in the dataset. Can mode of transit be predicted by any set of variables?